Comparing common and rare single variant and gene aggregate instrumentation strategies for molecular MR
Author
Aimee Hanson
Published
February 2, 2026
Introduction
Classically, Mendelian Randomisation (MR) methods utilising trait-associated genetic variants from GWAS studies have employed common polymorphisms (population MAF > 1%) targeted by genotyping chips, or reliably imputed from reference populations, to instrument modifiable exposures. However, common variants tested in GWAS typically explain a very small fraction of the variability in a measured trait, potentially exhibit pleiotropic effects acted upon by balancing selection or as a consequent of genetic linkage, and are rarely causal. Rare variants, which typically show large biological effects (e.g. through abolishing protein expression) provide a means of more unambiguously instrumenting relevant molecular processes. Comparison of causal estimates derived using differing methods of genetically instrumenting modifiable exposures may enhance the interpretation of the biological mechanisms underlying exposure-outcome relationships. This includes using variants from across the allele frequency spectrum, but also leveraging rare variant aggregate approaches to instrument gene-level perturbations in expression and function.
Here, caual estimates for pairwises combinations of 2937 molecular exposures (O-link plasma proteins) and 11 health related compelex traits (below) have been derived using nine instrumentation strategies, with further instrument splitting into cis (variants within 1MB of the encoding gene start and end site), trans and full variant sets (a total of 27 instrument sets per exposure).
Outcomes: Low density lipoprotein levels (LDL), Body Mass Index (BMI), Vitamin D (VitD), Triglycerides (Trig), Glycated Haemoglobin (HbA1c), Mean Platelet Volume (PlatVol), IGF-1 (IGF1), Waist-to-Hip Ratio (WHRadjBMI), Red Blood Cell Erythrocyte Count (RBCCount), Mean Corpuscular Volume (CorpVol) and Systolic Blood Pressure (SBP)
Instrumental variable (IV) sets
Nine sets of IVs for each protein exposure have been extracted from across two studies:
GWAS common variants: “Plasma proteomic associations with genetics and health in the UK Biobank” (Sun, B. 2023, Nature).
WES variants/gene aggregate tests: “Rare variant associations with plasma protein levels in the UK Biobank” (Dhindsa, R. 2023, Nature).
Common GWAS
Common genome-wide fine-mapped variants from the GPMAP (>1% MAF) (cis, trans and combined)
Common genome-wide clumped and pruned variants (>1% MAF) (cis, trans and combined)
WES (variants)
Common exome-wide (>1% MAF) (cis, trans and combined)
Rare exome-wide (0-1% MAF) (cis, trans and combined)
Ultra-rare exome-wide (0-0.1% MAF) (cis, trans and combined) LD clumping was performed for common variants (for which LD matrices have been derived) only. For rare variants, instrument sets were stored both pre and post filtering to the top associated variant per gene for sensitivity analyses.
WES (masks)
pLOF burden mask (cis, trans and combined)
missense/low-confidence burden mask (cis, trans and combined)
pLOF/missense/low-confidence burden mask (cis, trans and combined)
synonymous burden mask (cis, trans and combined)
Data import
For direct comparison of the performance of instrument sets, the theshold for pQTL significance has been set at 5x10^-8 for both rare and common variants. MR results for all exposure-outcome pairs, derived using differing IV sets, and the harmonised summary statistics of those variant sets, are read in below.
Read in MR results from scripts/04_analysis_proteins/001_performmolecularMR.R and harmonised summary statistics:
The average number of genetic instruments (pQTLs) per protein across instrument sets
Rare and common variant instruments (pQTLs) have been partitioned according to their proximity to the instrumented protein-coding gene into cis (+- 1MB from gene boundaries) and trans (>1MB distant) sets. A similar number of cis-IVs per protein can be derived from each of the IV sets (typically <5). Proteins that are instrumented with >50 rare cis variants fall in gene dense genomic regions. Available IVs drop when restricting to one rare variant per annotated gene, but rare IVs have not been explicitly LD pruned so it is possible that in some instances these variants are not independent. More trans-IVs are availble in the common-varaint instrument sets (both GWAS and ExWAS derived), suggesting there are fewer rare and ultra-rare variants that associate with protein levels in trans.
Code
# Number of instruments per protein for each IV setnum_IVs <-lapply(res_MR, function(x){ df_out <-data.frame(expand.grid(IV_set = IV_sets, cis_trans =c("cis","trans")))for(i in1:length(x)){ df <- x[[i]] |> dplyr::summarise(unique(nsnp), .by =c(IV_set, cis_trans))colnames(df)[3] <-names(x[i]) df_out <-left_join(df_out, df, by =c("IV_set","cis_trans")) } df_out <- df_out |> tidyr::pivot_longer(cols =3:length(df_out)) |>na.exclude()return(df_out)})num_IVs <-do.call("rbind", num_IVs) |> dplyr::mutate(outcome =sub(".*_","",name))# Plotp1 <-mapply(FUN =function(x,y){ x <- x |> dplyr::filter(cis_trans == y) |> dplyr::mutate(label =factor(IVsets[IV_set], levels = IVsets)) |> dplyr::filter(label %in%grep("exome|genome", IVsets, value = T))ggplot(x, aes(x = label, y = value, fill = label)) +facet_grid(.~outcome) +stat_boxplot(geom ='errorbar', width =0.2) +geom_boxplot(width =0.7) +scale_y_log10() +scale_fill_manual(values = cols_IVsets) +theme_minimal() +theme(axis.text.x =element_text(angle =55, hjust =1),legend.position ="none",plot.title =element_text(hjust =0.5)) +labs(x ="IV set",y ="Num. instruments", title =paste0("Number of ",y ," instruments per plasma protein"))}, x =list(num_IVs,num_IVs), y =c("cis","trans"), SIMPLIFY =FALSE)gridExtra::grid.arrange(grobs = p1)
Figure 1
The number of proteins able to be instrumented with pQTLs across instrument sets
Note, only instrument sets including >=3 independent variants are considered.
Although the power to detect highly significant rare-variant associations is expected to be lower than that for common variants due to to low alternate allele counts, an approximately equivalent percentage (~25%) of the plasma protein can be instrumented using GWAS common or ExWAS rare variants in cis (and ~10-12% with ultra-rare variants). ExWAS common variants provide less coverage, likely because fewer independent cis-variants will typically be found in the coding region of a gene than within the 1MB surrounding putative regulatory region captured by GWAS. Alternatively, over 50% of the proteome can be instrumented with ExWAS and GWAS derived common variants relative to <10% with rare and ultra-rare variants. This again suggests there are fewer (or less readily detected) rare trans-variant associations with plasma protein levels. This is slightly counterintuitive, as it would be anticipated that stronger rare varaint cis-effects on protein levels would also be detected as strong trans associations with proteins in interacting pathways.
What are some potential explanations for this?:
- The common variants that exist at intermediate frequencies in a population (i.e. have not reached fixation) are acted upon by competing balancing selective forces and show a greater number of trans assocaitions across diverse biological processes (i.e. are more plieotropic). Converse to this is the idea that rare variants can only be observed in genomic regions where the DNA sequence has been evolutionarially optimised (conserved) for a highly specific biological mechanism.
- The nature of the genes instrumented by common and rare variants is fundamentally different. For example the proteins instrumented by common cis-IVs are more often involved in the regulation of diverse cellular pathways (e.g. variants in enhancers impacting transcriptional or cell signalling regulation), where strong-effect rare variants modify protein levels or function directly, with consequences for a smaller numebr of interacting/downstream proteins?
Questions:
- What is the relative strength (betas) of cis and trans, rare and common pQTL effects?
- What is the intersection of the plasma proteins that have rare and common IVs available (significant pQTL associations)?
On average, how many molecular traits (protein levels) does a variant significantly associate with across instrument sets (e.g. ultra-rare, rare, common)?
To assess the pleiotropy profiles of variants in each IV set, all significant pQTL associations reported with a given variant (p<=5e-8) were extracted from the O-link protein GWAS/ExWAS summary statistics.
Code
lookup_files <-list.files(file.path(data_dir, "sumstats/proteomics"), pattern ="pQTL_lookup_*", full.names =TRUE)lookup_masks_files <-list.files(file.path(data_dir, "sumstats/proteomics"), pattern ="pQTL_lookup_dhindsa_exwas_mask_*", full.names =TRUE)lookup_exome_files <-list.files(file.path(data_dir, "sumstats/proteomics"), pattern ="pQTL_lookup_dhindsa_exwas_variant_*", full.names =TRUE)lookup_genome_files <-list.files(file.path(data_dir, "sumstats/proteomics"), pattern ="pQTL_lookup_sun_gwas_variant_*", full.names =TRUE)# Pleiotropy matrices (for each IV set)# How many proteins do IVs associated with?lookup_pQTLs <-lapply(lookup_files, function(x){data.table::fread(x)})names(lookup_pQTLs) <-sub(".*pQTL_lookup_(.*)\\.tsv","\\1",lookup_files)# Format to keep significant variant-protein associationslookup_pQTLs <-lapply(lookup_pQTLs, function(x){ x <-as.data.frame(x)if(!("GENE"%in%colnames(x))){x$GENE <-NA} out <-data.frame(variant =sub(".*\\.tsv:","",x[,1]),protein_target = x$protein_target,proximal_gene = x$GENE,pvalue = x[,which(colnames(x) %in%c("P","pValue"))],cis_trans = x$cis_trans) |> dplyr::arrange(variant)return(out)})# Retain associations at 5x10^-8 (note gene masks are already thresholded at 1x10^-8)lookup_pQTLs <-lapply(lookup_pQTLs, function(x){ x |> dplyr::filter(pvalue <=5e-8&!(cis_trans =="")) # Exclude composite protein targets with no cis_trans assignment})# Summarise associations per variant (pQTL)pQTLs_assocs <-mapply(function(x, y){ x <- x |> dplyr::group_by(variant, proximal_gene) |> dplyr::summarise(n_assocs =n(), n_cis =length(which(cis_trans =="cis")),n_trans =length(which(cis_trans =="trans"))) x$IV_set <- yreturn(x)}, x = lookup_pQTLs, y =as.list(names(lookup_pQTLs)), SIMPLIFY = F)
Code
pqtls_assocs_joined <-do.call("rbind", pQTLs_assocs) |> dplyr::filter(grepl("variant",IV_set)) |> dplyr::mutate(IV_set =factor(IVsets[IV_set], levels = IVsets)) |> dplyr::group_by(IV_set) |> dplyr::mutate(label =ifelse(n_trans ==max(n_trans) &grepl("exome",IV_set), paste(variant,proximal_gene), NA)) |> dplyr::ungroup()p5 <- pqtls_assocs_joined |>ggplot(aes(x = IV_set, y = n_assocs)) +geom_boxplot(aes(fill = IV_set)) +scale_fill_manual(values = cols_IVsets) +scale_y_log10() +theme_minimal() +theme(axis.text.x =element_text(angle =55, hjust =1),legend.position ="none") +labs(x ="", y ="Number of pQTL associations")p6 <- pqtls_assocs_joined |>ggplot(aes(x = n_cis, y = n_trans, colour = IV_set)) +facet_wrap(.~IV_set, nrow =2) +geom_point() + ggrepel::geom_text_repel(aes(label = label), size =3) +scale_colour_manual(values = cols_IVsets) +theme_minimal() +theme(legend.position ="none") +labs(x ="Number of cis-pQTL associations", y ="Number of trans-pQTL associations")gridExtra::grid.arrange(p5,p6,nrow =1, widths =c(0.2,0.7), top ="Pleiotropy profile of IV sets")
Causal effect estimates
Proportion of instrumented proteins with significant IVW causal effects detected across instrument sets
Restricting to only those proteins that could be successfully instrumented (with n >= 3 IVs) by all strategies shown below, a significant causal effect of protein levels on a given outcome tend to be more often detected using common cis-IVs from GWAS than rare and ultra-rare cis-IVs. Here, significance has been defined at two thresholds, the liberal threshold of p<=0.05 and the conservative Bonferroni adjusted significance threshold given n=2937 independent proteins tested (p<=1.7x10^-5).
Code
# Calculate the proportion of significant exposures across variant sets# Keep proteins instrumentable by all variant-based IV setsprop_sig_IVs <-function(res_summary, p_thresh){ df_p <- res_summary$min_p[,names(variant_IVsets_indep)] |>na.exclude() |>as.data.frame() |> tibble::rownames_to_column("row") |> dplyr::mutate(gene =sub("_.*","",row),outcome =sub(".*_","",row)) |> tidyr::pivot_longer(2:6, names_to ="IV_set", values_to ="pval") df_beta <- res_summary$min_p_beta[,names(variant_IVsets_indep)] |>na.exclude() |>as.data.frame() |> tibble::rownames_to_column("row") |> dplyr::mutate(gene =sub("_.*","",row),outcome =sub(".*_","",row)) |> tidyr::pivot_longer(2:6, names_to ="IV_set", values_to ="beta") df <-left_join(df_p, df_beta) |> dplyr::mutate(call =ifelse(pval <= p_thresh & beta >0, "sig_up",ifelse(pval <= p_thresh & beta <0, "sig_down", "ns"))) df_grouped <- df |> dplyr::group_by(IV_set,call) |> dplyr::summarise(n_genes =n()) |> dplyr::mutate(prop = n_genes/sum(n_genes) *100,outcome =unique(df$outcome),n_genes_tot =sum(n_genes))return(df_grouped)}sig_IVs <-do.call("rbind", lapply(summary_sig, prop_sig_IVs, p_thresh = alpha))sig_IVs_cis <-do.call("rbind", lapply(summary_sig_cis, prop_sig_IVs, p_thresh = alpha))sig_IVs_trans <-do.call("rbind", lapply(summary_sig_trans, prop_sig_IVs, p_thresh = alpha))sig_IVs_0.01<-do.call("rbind", lapply(summary_sig, prop_sig_IVs, p_thresh =0.01))sig_IVs_0.01_cis <-do.call("rbind", lapply(summary_sig_cis, prop_sig_IVs, p_thresh =0.01))sig_IVs_0.01_trans <-do.call("rbind", lapply(summary_sig_trans, prop_sig_IVs, p_thresh =0.01))# Plotp3 <-mapply(function(x,y){ x |> dplyr::filter(call %in%c("sig_up","sig_down")) |> dplyr::mutate(label =factor(variant_IVsets_indep[IV_set], levels = variant_IVsets_indep)) |>ggplot(aes(x = label, y = prop, fill = call)) +facet_grid(.~outcome) +geom_bar(position ="stack", stat ="identity") +scale_fill_manual(name ="Direction of effect", values =c("steelblue","tomato"), label =c("down","up")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5),legend.position ="bottom",plot.margin =margin(t =5, r =5, b =5, l =30)) +labs(x ="IV set",y ="%", title =paste0("Percentage of plasma protein causally associated with outcome using ", y, "IVs (at P<=0.01)"))}, x =list(sig_IVs_0.01, sig_IVs_0.01_cis, sig_IVs_0.01_trans), y =c("all ","cis-","trans-"), SIMPLIFY =FALSE)gridExtra::grid.arrange(grobs = p3)p4 <-mapply(function(x,y){ x |> dplyr::filter(call %in%c("sig_up","sig_down")) |> dplyr::mutate(label =factor(variant_IVsets_indep[IV_set], levels = variant_IVsets_indep)) |>ggplot(aes(x = label, y = prop, fill = call)) +facet_grid(.~outcome) +geom_bar(position ="stack", stat ="identity") +scale_fill_manual(name ="Direction of effect", values =c("steelblue","tomato"), label =c("down","up")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(hjust =0.5),legend.position ="bottom",plot.margin =margin(t =5, r =5, b =5, l =30)) +labs(x ="IV set",y ="%", title =paste0("Percentage of plasma protein causally associated with outcome using ", y, "IVs (at P<=1.7x10^-5)"))}, x =list(sig_IVs, sig_IVs_cis, sig_IVs_trans), y =c("all ","cis-","trans-"), SIMPLIFY =FALSE)gridExtra::grid.arrange(grobs = p4)
Figure 3
Figure 4
Comparison of IVW effect estimates across instrument sets
Code
# Functions# Generate a table of IVW estimates for a pair of instrument setscompare_estimates <-function(results_MR, IVset1, IVset2, n_snps=3, cis_trans="all"){# Pull out IVW estimates and SE for instrument setsif(cis_trans =="all"){ df <-lapply(results_MR, function(x){ x |> dplyr::filter(method =="Inverse variance weighted", cis_trans =="cis_trans", IV_set %in%c(IVset1,IVset2)) }) }if(cis_trans =="cis"){ df <-lapply(results_MR, function(x){ x |> dplyr::filter(method =="Inverse variance weighted", cis_trans =="cis", IV_set %in%c(IVset1,IVset2)) }) }if(cis_trans =="trans"){ df <-lapply(results_MR, function(x){ x |> dplyr::filter(method =="Inverse variance weighted", cis_trans =="trans", IV_set %in%c(IVset1,IVset2)) }) } df_out_b <-do.call("rbind",df) |> dplyr::filter(nsnp >= n_snps) |> tidyr::pivot_wider(names_from ="IV_set", values_from =c("b"), id_cols ="exposure") |>na.exclude() |> dplyr::relocate(exposure,IVset1,IVset2) df_out_se <-do.call("rbind",df) |> dplyr::filter(nsnp >= n_snps) |> tidyr::pivot_wider(names_from ="IV_set", values_from ="se", id_cols ="exposure") |>na.exclude() |> dplyr::relocate(exposure,IVset1,IVset2) df_out_p <-do.call("rbind",df) |> dplyr::filter(nsnp >= n_snps) |> tidyr::pivot_wider(names_from ="IV_set", values_from ="pval", id_cols ="exposure") |>na.exclude() |> dplyr::relocate(exposure,IVset1,IVset2)colnames(df_out_p)[2:3] <-paste0(colnames(df_out_p)[2:3],".pval")## Caluclate effect heterogeneityif (!(identical(df_out_b$exposure, df_out_se$exposure) &&identical(colnames(df_out_b), colnames(df_out_se)))) {message("Error - mismatch of beta and SE dataframes")return(NULL) } beta_vecs <-lapply(split(df_out_b[,c(2,3)], 1:nrow(df_out_b)), as.numeric) se_vecs <-lapply(split(df_out_se[,c(2,3)], 1:nrow(df_out_se)), as.numeric) het_res <-do.call("rbind", mapply(effect_heterogeneity, beta_vec = beta_vecs, se_vec = se_vecs, SIMPLIFY = F)) |>as.data.frame() |> dplyr::mutate(exposure = df_out_b$exposure, .before ="Q") |> dplyr::mutate(Qpval_bon =p.adjust(Qpval, method ="bonferroni"),Qpval_fdr =p.adjust(Qpval, method ="fdr")) df_out <- dplyr::left_join(df_out_b, df_out_se, by ="exposure", suffix =c(".beta",".se")) |> dplyr::left_join(df_out_p, by ="exposure") |> dplyr::left_join(het_res, by ="exposure")return(df_out)}# Generate a table of cis vs trans IVW estimates for given instrument setcompare_cistrans <-function(results_MR, IVset, n_snps=3){# Pull out IVW estimates and SE for instrument sets df <-lapply(results_MR, function(x){ x |> dplyr::filter(method =="Inverse variance weighted", IV_set %in% IVset, nsnp >= n_snps, cis_trans %in%c("cis","trans")) |> dplyr::mutate(IV_set =paste(IV_set, cis_trans, sep ="_")) |> dplyr::arrange(cis_trans)}) df <- df[lapply(df, nrow) ==2] df_out_b <-do.call("rbind",df) |> tidyr::pivot_wider(names_from ="IV_set", values_from ="b", id_cols ="exposure") |>na.exclude() df_out_se <-do.call("rbind",df) |> tidyr::pivot_wider(names_from ="IV_set", values_from ="se", id_cols ="exposure") |>na.exclude() df_out_p <-do.call("rbind",df) |> tidyr::pivot_wider(names_from ="IV_set", values_from ="pval", id_cols ="exposure") |>na.exclude()colnames(df_out_p)[2:3] <-paste0(colnames(df_out_p)[2:3],".pval")## Caluclate effect heterogeneityif (!(identical(df_out_b$exposure, df_out_se$exposure) &&identical(colnames(df_out_b), colnames(df_out_se)))) {message("Error - mismatch of beta and SE dataframes")return(NULL) } beta_vecs <-lapply(split(df_out_b[,c(2,3)], 1:nrow(df_out_b)), as.numeric) se_vecs <-lapply(split(df_out_se[,c(2,3)], 1:nrow(df_out_se)), as.numeric) het_res <-do.call("rbind", mapply(effect_heterogeneity, beta_vec = beta_vecs, se_vec = se_vecs, SIMPLIFY = F)) |>as.data.frame() |> dplyr::mutate(exposure = df_out_b$exposure, .before ="Q") |> dplyr::mutate(Qpval_bon =p.adjust(Qpval, method ="bonferroni"),Qpval_fdr =p.adjust(Qpval, method ="fdr")) df_out <- dplyr::left_join(df_out_b, df_out_se, by ="exposure", suffix =c(".beta",".se")) |> dplyr::left_join(df_out_p, by ="exposure") |> dplyr::left_join(het_res, by ="exposure")return(df_out)}plot_estimates <-function(estimates, name){ estimates <-as.data.frame(estimates) col_names <-names(estimates)if(any(grepl("cis|trans",col_names))){ lab_x <-"cis-IVs" lab_y <-"trans-IVs" } else { lab_x <- IVsets[sub("\\.beta","",col_names[2])] lab_y <- IVsets[sub("\\.beta","",col_names[3])] } CI_x <-data.frame(LCI = estimates[,2] -1.96*estimates[,4], UCI = estimates[,2] +1.96*estimates[,4]) CI_y <-data.frame(LCI = estimates[,3] -1.96*estimates[,5], UCI = estimates[,3] +1.96*estimates[,5]) mod_lm <-lm(estimates[,3] ~ estimates[,2]) slope <-round(coefficients(mod_lm)[2],2) flag_point <- estimates$Qpval_bon <=0.05 range_min <-min(min(estimates[,2]), min(estimates[,3])) #min(na.exclude(errorbar_xmin, errorbar_ymin))) range_max <-max(max(estimates[,2]), max(estimates[,3])) #min(na.exclude(errorbar_xmax, errorbar_ymax))) estimates_plot <- estimates |> dplyr::mutate(exposure =factor(exposure, levels = estimates$exposure[order(estimates$Qpval, decreasing =TRUE)]),errorbar_xmin =ifelse(flag_point, CI_x$LCI, NA),errorbar_xmax =ifelse(flag_point, CI_x$UCI, NA),errorbar_ymin =ifelse(flag_point, CI_y$LCI, NA),errorbar_ymax =ifelse(flag_point, CI_y$UCI, NA)) |> dplyr::arrange(exposure) estimates_plot |>ggplot(aes_string(x = col_names[2], y = col_names[3])) +geom_abline(slope =1, intercept =0, colour ="steelblue", lty =2, lwd =0.5) +geom_point(aes(colour = estimates_plot$Qpval_bon <=0.05), size =2, shape =16) +geom_errorbar(xmin = estimates_plot$errorbar_xmin,xmax = estimates_plot$errorbar_xmax, colour ="tomato") +geom_errorbar(ymin = estimates_plot$errorbar_ymin,ymax = estimates_plot$errorbar_ymax, colour ="tomato") +geom_smooth(method ="lm", se =FALSE, colour ="tomato", lwd =0.5) +annotate(geom ="text", x = range_min/2, y = range_max, label =paste0("Slope=",slope), size =3, colour ="tomato") +scale_colour_manual(values =c("black","tomato")) +scale_x_continuous(limits =c(range_min, range_max)) +scale_y_continuous(limits =c(range_min, range_max)) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =10), legend.position ="none") +labs(x=lab_x, y=lab_y, title = name)}# Upset plot comparing signifiant causal estimates across variant sets# Restrtcited to those proteins that could be instrumented with >= 3 variants across the comparison setsplot_upset <-function(IVsets_list, cis_trans ="all", pval =0.01){if(cis_trans =="all"){inlist = summary_sig}if(cis_trans =="cis"){inlist = summary_sig_cis}if(cis_trans =="trans"){inlist = summary_sig_trans} set_sig <-lapply(inlist, function(x){ dat <- x$min_p[,IVsets_list]# Extract only those tests that could be conducted in all IV sets tests <-rownames(dat) # Exposure_outcome shared_tests <- tests[apply(dat, MARGIN =1, function(x){sum(!(is.na(x)))}) ==2] dat <- dat[shared_tests,] dat_sig <- dat <= pval tests <-rownames(dat_sig) list_sig <-apply(dat_sig, MARGIN =2, FUN =function(x){tests[x]}, simplify = F) # Names of significant exposuresnames(list_sig) <- IVsets[names(list_sig)] list_sig <- list_sig[lapply(list_sig, length) >0] # Remove empty exposure setsreturn(list_sig) })# Upset plots plots <-mapply(FUN =function(list_sig, outcome){if(length(list_sig) ==1){return(NULL)}upset(fromList(list_sig), order.by ="freq", text.scale =1.5, mainbar.y.label =paste0("Protein:",outcome,"\n (IVW p<", pval, ")")) }, list_sig = set_sig, outcome =names(set_sig), SIMPLIFY = F) plots <- plots[!sapply(plots, is.null)]# for (v in names(plots)) {# print(plots[[v]])# #grid.text(v, x = 0.65, y=0.97, gp = gpar(fontsize = 20))# grid.edit('arrange', name = v)# vp <- grid.grab()# plots[[v]] <- vp# }return(plots)}
1a) Common variant cis-IVs (finemapped vs clumped)
1b) Common variant trans-IVs (finemapped vs clumped)
2a) Common variant cis-IVs (GWAS vs ExWAS)
2b) Common variant trans-IVs (GWAS vs ExWAS)
Code
comp_IVWs_common_1a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="sun_gwas_variant_common_finemapped",n_snps =3, cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_common_1a, name =names(comp_IVWs_common_1a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from common cis-pQTL instruments (finemapped vs clumped)")comp_IVWs_common_1b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="sun_gwas_variant_common_finemapped",n_snps =3, cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_common_1b, name =names(comp_IVWs_common_1b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from common trans-pQTL instruments (finemapped vs clumped)")comp_IVWs_common_2a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="sun_gwas_variant_common_clumped",n_snps =3, cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_common_2a, name =names(comp_IVWs_common_2a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES common cis-pQTL instrumentation strategies")comp_IVWs_common_2b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="sun_gwas_variant_common_clumped",n_snps =3, cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_common_2b, name =names(comp_IVWs_common_2b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES common trans-pQTL instrumentation strategies")
Figure 5
Figure 6
Figure 7
Figure 8
3a) Common vs rare variant cis-IVs (ExWAS)
3b) Common vs rare variant trans-IVs (ExWAS)
4a) Common vs ultra-rare variant cis-IVs (ExWAS)
4b) Common vs ultra-rare variant trans-IVs (ExWAS)
Code
comp_IVWs_3a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="dhindsa_exwas_variant_rare",n_snps =3,cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_3a, name =names(comp_IVWs_3a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from WES common and rare cis-pQTL instrumentation strategies")comp_IVWs_3b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="dhindsa_exwas_variant_rare",n_snps =3,cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_3b, name =names(comp_IVWs_3b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from WES common and rare trans-pQTL instrumentation strategies")comp_IVWs_4a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="dhindsa_exwas_variant_ultrarare",n_snps =3,cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_4a, name =names(comp_IVWs_4a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from WES common and ultra-rare cis-pQTL instrumentation strategies")comp_IVWs_4b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="dhindsa_exwas_variant_common",IVset2 ="dhindsa_exwas_variant_ultrarare",n_snps =3,cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_4b, name =names(comp_IVWs_4b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from WES common and ultra-rare trans-pQTL instrumentation strategies")
Figure 9
Figure 10
Figure 11
Figure 12
5a) Common (GWAS) vs rare variant cis-IVs (ExWAS)
5b) Common (GWAS) vs rare variant trans-IVs (ExWAS)
6a) Common (GWAS) vs ultra-rare variant cis-IVs (ExWAS)
6b) Common (GWAS) vs ultra-rare variant trans-IVs (ExWAS)
Code
comp_IVWs_5a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="dhindsa_exwas_variant_rare",n_snps =3,cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_5a, name =names(comp_IVWs_5a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES rare cis-pQTL instrumentation strategies")comp_IVWs_5b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="dhindsa_exwas_variant_rare",n_snps =3,cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_5b, name =names(comp_IVWs_5b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES rare trans-pQTL instrumentation strategies")comp_IVWs_6a <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="dhindsa_exwas_variant_ultrarare",n_snps =3,cis_trans ="cis")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_6a, name =names(comp_IVWs_6a), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES ultra-rare cis-pQTL instrumentation strategies")comp_IVWs_6b <-lapply(res_MR, FUN = compare_estimates, IVset1 ="sun_gwas_variant_common_clumped",IVset2 ="dhindsa_exwas_variant_ultrarare",n_snps =3,cis_trans ="trans")gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_6b, name =names(comp_IVWs_6b), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and WES ultra-rare trans-pQTL instrumentation strategies")# Extract list of genes with significant heterogenity between cis common and rare/ultrarare variant estimateshet_genes <-lapply(Map(c, lapply(comp_IVWs_3a, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_4a, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_5a, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_6a, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)})), function(x){unique(x)})het_genes <-mapply(FUN =function(x,y){paste(x,y,sep ="_")}, x = het_genes, y =names(het_genes), SIMPLIFY = F)
Figure 13
Figure 14
Figure 15
Figure 16
Overlap between protein:outcome pairs with significant causal effectes estimated across cis-IV sets
comp_IVWs_7 <-lapply(res_MR, FUN = compare_cistrans, IVset ="sun_gwas_variant_common_clumped",n_snps =3)gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_7, name =names(comp_IVWs_7), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from GWAS common and cis- and trans-pQTL instrumentation strategies")comp_IVWs_8 <-lapply(res_MR, FUN = compare_cistrans, IVset ="dhindsa_exwas_variant_common",n_snps =3)gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_8, name =names(comp_IVWs_8), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from ExWAS common and cis- and trans-pQTL instrumentation strategies")comp_IVWs_9 <-lapply(res_MR, FUN = compare_cistrans, IVset ="dhindsa_exwas_variant_rare",n_snps =3)gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_9, name =names(comp_IVWs_9), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from ExWAS rare and cis- and trans-pQTL instrumentation strategies")comp_IVWs_10 <-lapply(res_MR, FUN = compare_cistrans, IVset ="dhindsa_exwas_variant_ultrarare",n_snps =3)gridExtra::grid.arrange(grobs =mapply(plot_estimates, estimates = comp_IVWs_10, name =names(comp_IVWs_10), SIMPLIFY =FALSE), nrow =2,top ="Comparison of IVW estimates from ExWAS ultra-rare and cis- and trans-pQTL instrumentation strategies")# Extract list of genes with significant heterogenity between cis and trans estimateshet_genes_cistrans <-lapply(Map(c, lapply(comp_IVWs_7, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_8, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_9, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)}),lapply(comp_IVWs_10, function(x){x |> dplyr::filter(Qpval_bon <=0.1) |> dplyr::pull(exposure)})), function(x){unique(x)})het_genes_cistrans <-mapply(FUN =function(x,y){paste(x,y,sep ="_")}, x = het_genes_cistrans , y =names(het_genes_cistrans), SIMPLIFY = F)
Figure 17
Figure 18
Figure 19
Figure 20
Heterogenity in pQTL effects
Restricting to those protein exposures that show signficiant causal effects on a given outcome (at p <= 0.05 with >= 3 IVs), and significant heterogenity between rare and common cis variant IV sets (at Qstat P(FDR) < 0.05):
Code
# Retain exposure-outcome pairs with significant IVW estimate for at least one instrumentation strategy (with n instruments >= n_snps)res_MR_top <-lapply(res_MR, filter_results, p_thresh =0.05, n_snps =3)# Filter to those with evidence of estimate heterogenity between variant setsres_MR_top_het <-mapply(function(x,y){x[names(x) %in% y]}, x = res_MR_top, y = het_genes, SIMPLIFY = F)# Filter to those with evidence of estimate heterogeneity between cis and trans IVsres_MR_top_cistrans_het <-mapply(function(x,y){x[names(x) %in% y]}, x = res_MR_top, y = het_genes_cistrans, SIMPLIFY = F)
Heterogeneity in rare v common cis IV causal estimates
Concordance vs discordance in cis and trans effects
How often do causal estimates for a protein:outcome effect derived from cis- and trans-pQTLs agree (in terms of significance and direction of effect)? [TO ADD]
Consistency in pQTL effects
Restricting to those protein exposures that show signficiant causal effects on a given outcome (at p <= 0.05 with >= 3 IVs) supported by rare and common cis variant IV sets (Qsat P(FDR) < 0.05):
Code
# Extract list of genes with consistent (significant) effects between cis common and rare/ultrarare variant estimatesconsistant_genes <-lapply(Map(c, lapply(comp_IVWs_3a, function(x){x |> dplyr::filter(Qpval_bon >0.1& x[,6] <0.05& x[,7] <0.05&sign(x[,2]) ==sign(x[,3])) |> dplyr::pull(exposure)}),lapply(comp_IVWs_4a, function(x){x |> dplyr::filter(Qpval_bon >0.1& x[,6] <0.05& x[,7] <0.05&sign(x[,2]) ==sign(x[,3])) |> dplyr::pull(exposure)}),lapply(comp_IVWs_5a, function(x){x |> dplyr::filter(Qpval_bon >0.1& x[,6] <0.05& x[,7] <0.05&sign(x[,2]) ==sign(x[,3])) |> dplyr::pull(exposure)}),lapply(comp_IVWs_6a, function(x){x |> dplyr::filter(Qpval_bon >0.1& x[,6] <0.05& x[,7] <0.05&sign(x[,2]) ==sign(x[,3])) |> dplyr::pull(exposure)})), function(x){unique(x)})consistant_genes <-mapply(FUN =function(x,y){paste(x,y,sep ="_")}, x = consistant_genes, y =names(consistant_genes), SIMPLIFY = F)# Filter to significant MR results supported across IV setsres_MR_top_consistant <-mapply(function(x,y){x[names(x) %in% y]}, x = res_MR_top, y = consistant_genes, SIMPLIFY = F)res_IVW_consistant <-lapply(res_MR_top_consistant, function(x){ df <-do.call("rbind", x) |>filter(method =="Inverse variance weighted",grepl("variant",IV_set), cis_trans =="cis") |> dplyr::mutate(instrument_class =factor(variant_IVsets[IV_set], levels =rev(variant_IVsets)))return(df)})
Assessing for monotonically increasing causal effect estimates (Wald Ratios) across rare-variant series of alleles:
- Rare variants (<1% MAF): benign missense variants (BMV) - damaging missense variants (DMVs) - protein truncating variants (PTVs)
Using the COAST-SS approach (developed for allelic series association tests), which has been developed to handle GWAS summary statistics, but here we are passing Wald Ratios (variant outcome effect/ variant protein effect)
What is needed here?:
- Summary statistics for all rare variants (capped at p<0.01 by the AzPheWAS portal for pQTLs) falling in the coding region of genes encoding the O’link plasma proteome assayed proteins
- The corresponding variant associations for outcomes of interest (with which to calculate Wald ratios) - Variant predicted functional effects (with which to categorise SNPs)
Magnitude of variant class effects
Do cis-variants predicted to be more deleterious in nature (loss of function) actually have larger effects on protein expression?
# Load results from COAST-SS allelic series anlysis # 1 = synonymous/noncoding, 2 = benign missense, 3 = deleterious missense, 4 = putative loss of funcionload(file.path(res_dir,"results_molecular_COASTSS_relaxed.rda")) # coastss_annots_relaxed## Filter out exposure genes for which there were insufficient variants for alleleic series analysis and calculate adjusted p-valscoastss_annots_relaxed <-lapply(coastss_annots_relaxed, function(x){ x |> dplyr::filter(!is.na(alleleic_skat_p)) |> dplyr::mutate(adj_p =p.adjust(alleleic_skat_p, method ="BH"))})outcome_names <-c("LDL direct"="LDL","Body mass index (BMI)"="BMI","Vitamin D"="Vitamin D","Triglycerides"="Triglycerides","Glycated haemoglobin (HbA1c)"="HbA1c","Mean platelet (thrombocyte) volume"="PlatVol","IGF-1"="IGF-1","Waist-to-hip ratio adjusted for body mass index (WHRadjBMI)"="WHRadjBMI","Red blood cell (erythrocyte) count"="RBCcount","Mean corpuscular volume"="CorpVol","Systolic blood pressure, automated reading"="SBP")do.call("rbind", coastss_annots_relaxed) |> dplyr::mutate(label =ifelse(adj_p <0.05, exposure, NA),outcome = outcome_names[outcome]) |>ggplot(aes(x = exposure, y =-log10(alleleic_skat_p))) +facet_grid(outcome ~ ., scale ="free_y") +geom_point(aes(colour = adj_p <0.05)) +scale_colour_manual(name ="P(FDR) < 0.05", values =c("grey","tomato")) + ggrepel::geom_text_repel(aes(label = label), size =4, max.overlaps =Inf, force =10) +theme_minimal() +theme(strip.text =element_text(size =10),panel.grid =element_blank(),axis.text.x =element_blank(),axis.title =element_text(size =14),legend.position ="bottom") +labs(x ="Protein exposure", y ="-log10(P) Allelic series")
Dopamine Beta Hydrozylase (DBH) on systolic blood pressure
Code
plot_wald_vep(gene ="DBH", outcome ="SBP")
Gene based analysis
In gene based analysis, variant aggregate approaches are used to test for the effect of carrying any single or combination of rare genetic variants in a gene on a phenotype. Variant masks are often generated based on variant properties (e.g putative loss of function or missense variants), allowing the consequence of varying degrees of gene perturbation to be assessed. Here gene-based masks have been testing for cis and trans associations with plasma protein levels and complex traits, allowing them to be used in an MR framework. It is perhaps most straight forward to handle gene-level associations as single instuments and interpret Wald ratios, although in some case multiple gene-level instruments can exist for a given exposure.
Code
# Extract gene-based burden mask and calculate Wald ratiosres_masks <-lapply(harmonised_snps, function(x){ masks <-lapply(x, function(y){ dat <- y[IV_sets[1:4]] # gene mask results# Remove empty results dfs dat <- dat[unlist(lapply(dat, function(x){all(!(is.na(x$SNP)))}))] dat <-do.call("rbind", lapply(dat, function(z){ z |> dplyr::select(mask = test.exposure, mask_gene = SNP, chr = chr.outcome, pos = pos.outcome, exposure, outcome, beta.exposure, beta.outcome, se.exposure, se.outcome, pval.exposure, pval.outcome, cis_trans) })) })})res_masks <-lapply(res_masks, function(x){ out <-do.call("rbind",x)rownames(out) <-NULLreturn(out)})# Wald ratiosres_masks <-lapply(res_masks, function(x){ wrs <-apply(x, MARGIN =1, FUN =function(x){ TwoSampleMR::mr_wald_ratio(b_exp =as.numeric(x["beta.exposure"]),b_out =as.numeric(x["beta.outcome"]),se_exp =as.numeric(x["se.exposure"]),se_out =as.numeric(x["se.outcome"])) }) wrs_df <-do.call("rbind",lapply(wrs, unlist))colnames(wrs_df) <-paste("wald_ratio", colnames(wrs_df), sep ="_")# Bind to harmonised data out <-cbind(x,wrs_df)return(out)})
Gene-level cis-mask associations
Focusing on aggregate tests combining putative loss of function (protein truncating) and damaging rare varaints, which genes show significant causal effects on outcomes of interest:
Code
res_masks_cis <-lapply(res_masks, function(x){ out <- x |> dplyr::filter(cis_trans =="cis", mask =="ptvraredmg") |> dplyr::mutate(wald_ratio_pval_adj =p.adjust(wald_ratio_pval, method ="BH")) |> dplyr::arrange(wald_ratio_pval_adj)return(out)})# Plot Wald ratios for all gene maks associated with an outcomeplot_wald_masks <-function(results_masks, outcome, pthresh){ dat <- results_masks[[outcome]] |> dplyr::filter(wald_ratio_pval_adj < pthresh) |> dplyr::arrange(wald_ratio_b) |> dplyr::mutate(mask_gene =toupper(mask_gene),mask_gene =factor(mask_gene, levels = mask_gene)) p_wald <-ggplot(dat, aes(x = wald_ratio_b, y = mask_gene)) +geom_vline(xintercept =0, colour ="grey20", lty =2) +geom_point(colour ="steelblue4", size =2) +geom_errorbarh(aes(xmin = wald_ratio_b-1.96*wald_ratio_se, xmax = wald_ratio_b+1.96*wald_ratio_se),colour ="steelblue4",height =0.4) +labs(x ="Wald Ratio (By/Bx)", y ="Burden test gene \n (instrumenting cis-protein expression)", title =paste("Outcome:", outcome),subtitle ="cis-IVs (pLOF and rare damaging)") +theme_bw() +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5))return(p_wald)}gridExtra::grid.arrange(grobs =lapply(names(res_masks_cis)[-1], plot_wald_masks, results_masks = res_masks_cis, pthresh =0.1),nrow =2)
Figure 54
Cis vs trans gene-level mask associations
Code
res_masks_trans <-lapply(res_masks, function(x){ out <- x |> dplyr::filter(cis_trans =="trans", mask =="ptvraredmg") |> dplyr::mutate(wald_ratio_pval_adj =p.adjust(wald_ratio_pval, method ="BH")) |> dplyr::arrange(wald_ratio_pval_adj)return(out)})# Extract trans IVs for gene showing significant cis-instrumented causal effectscomp_masks_cistrans <-mapply(x = res_masks_cis, y = res_masks_trans, function(x,y){ x <- x[,c("mask", "mask_gene", "exposure", "outcome", "wald_ratio_b", "wald_ratio_se", "wald_ratio_pval", "wald_ratio_pval_adj")] y <- y[,c("mask", "mask_gene", "exposure", "outcome", "wald_ratio_b", "wald_ratio_se", "wald_ratio_pval", "wald_ratio_pval_adj")] join_res <- x |>left_join(y, by =c("exposure", "outcome", "mask"), suffix =c("_cis","_trans"))return(join_res)}, SIMPLIFY = F)# Plot comparison of wald ratios for significant cis or trans IV associations# lapply(comp_masks_cistrans, function(x){# dat <- x |># dplyr::filter(wald_ratio_pval_adj_cis < 0.1 | wald_ratio_pval_adj_trans < 0.1) |># na.exclude()# # dat |> ggplot(aes(x = wald_ratio_b_cis, y = wald_ratio_b_trans)) +# geom_point()# })
Rare copy number variants (CNVs)
Rare copy number variants (CNVs), including gene deletion (or complete ablation of function through loss of function mutations), and duplication (or higher order) events have been identified. As anticipated, these have been shown to have a monotonic effect on protein expression levels; that is, gene deletions reduce measured gene expression, and duplication events increase gene expression, often substantially (Milind and Pritchard et al.). Rare genetic variation with disruptive effects on protein expression, in both directional extremes, present opportunities for protein instrumentation for molecular MR. However, it is of interest whether instrumenting gene expression using deletion and duplication events returns consistent causal estimates for exposure –> outcome effects, or non-monotonicity in genetic effects exist. The latter would suggest divergent pathways through which cis-perturbatuions of a gene may act to influence a phentoype, or potentially indicate that gene expression is not an appropriate measure of the true functional consequences of such perturbations.
Wald ratios have been derived using published gene-level summary statistics (Milind et al.) for the effect of gene deletions, duplications and pLOF variants (burdan masks) on O’link plasma protein levels and a series of 56 complex traits measured in UKB individuals. Note, Wald ratios were derived for all available cis and trans gene masks and CNV events irrespective of whether the event showed a significant protein expression effect. P-value filtered results are imported below.
## Pull out instances of effect heterogeneity between LoF and duplication instrumentshet_cnv_wrs <-do.call("rbind", lapply(comp_WRs_lofdups[sapply(comp_WRs_lofdups, is.null) ==FALSE], function(x){x |> dplyr::filter(Qpval_adj <0.05)}))